DOCUMENT RESUME 



ED 142 577 



TM 006 399 



AUTHOR 
TITLE 

INSTITUTION 

REPORT NO 
PUB DATE 
NOTE 

EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Naccarato, Richard W. ; Gillmore r Gerald M. 

The Application of Gener alizabili ty Theory to a 

College-Level French Placement Exam. 

Washington Univ., Seattle. Educational Assessment 

Center. 
EAC-76-24 
Sep 76 
17p. 

MF-$C83 HC-$1.67 Plus Postage. 

♦College Placement; College Students; *French; Higher 
Education; Measurement Techniques; *Statistical 
Analysis; *Test Reliability; * Tests 
♦Generalizability Theory; Interrater Reliability 



ABSTRACT 

This paper involves an application of 
generalizability theory in assessing the dependability of a foreign 
language placement exam. The French Cloze test was administered to 
students within five levels of French classes and the results were 
scored by four different raters. Three specific generalizability 
coefficients are discussed along with implications of imposing three 
additional restrictions on the method by which future data are 
collected. The results show a very high item and student by item 
variance component and little variance due to the rater component. 
For this study adequate generalizability of students' scores was 
obtained for all generalizability coefficients using half of the 
total number of items on the exam and only one rater. Results 
indicate that in future decision studies involving tests of this 
nature each student should repond to the same set of items. Also, 
sufficient reliability may be obtained using only one rater per 
student but not all students need be rated by the same rater, 
(Author) 
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The Application of Generalizability Theory to a 
College-Level French Placement Exam 



A test, using the Cloze technique, was designed by Professor Victor Hanzeli 
of the University of Washington and pilot tested on UW students during Spring 
Quarter of 1975. This test consisted of five paragraphs in the French language, 
with 80 selected words deleted. The task of the students was to fill in the 
exact missing words. (Details of the development of the test may be obtained 
from Professor Hanzeli). 

The test was administered to students in five classes of different levels 
during the final week of the quarter. The classes were numbered as follows: 
103, 201, 202, 203, and 301 with 35, 16, 24, 24, and eight students respectively. 

Each test was scored by the same four independent raters. Raters scored 
*ch item as follows: two points if the answer given was exactly that desired, 
one point if the answer given was a synonym, and zero points otherwise. Along 
with this scoring method, two others were possible; 1) An item could be con- 
sidered correct onlv if it was scored two, and 2) An item could be considered 
correct if it was scored either one or two. The three methods yielded total 
scores which correlated very highly across all students (r > .97) and thus 
seemed unworthy of separate analyses. Arbitrarily, we chose the original method 
for all analyses to be reported here. 

The purpose of this paper is to show how data collected in a design such 
as that described above can be analyzed through use cf a generalizability theory 
(Cronbach, et al., 1972), so as to provide information on the dependability 
(reliability) of one's measurements within a variety of specific applications. 
•Cronbach, et al. , (1972) made a useful distinction between G-studies and D-studies. 
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The former are those done for the purpose of determining the magnitude of the 
various relevant sources of variance. The latter have as their purpose the 
providing of data for decision making. Data collected by a G-study can be 
also used for D-study purposes, however, the latter can be designed more effi- 
ciently if the former is done in advance. 
The Design of the French Cloze Exam 

In the present study the design of the G-study was a three-way completely 
crossed random t Tf ect3 analyses of variance design, with four raters rating 107 
students on 80 items. For our purposes 30 items, which were equally dispersed 
among the five sections, were selected from the original 80 items on the exam. 
It was necessary to select only 30 items due to processing limitations of the 
computer program, however, as we shall see soon, it is possible to state re- 
liabilities for any number of items. 

Table 3 depicts the random effects model, where all effects are assumed 
to be sampled from infinite universes. Of course one may conceive of a fixed 
or mixed effects model in a situation such as this, however, for the sake of 
parsimonv, the discussion will be: limited to the random effects model. (See 
Kane and Brennan, in press, for a more detailed discussion.) In Table 1, 
students are designated S, raters R, and items I. 



Insert Table 1 about here 



The results of the analyses of variance of the data yielded by the French 
Cloze test are found in Table 2. Of particular interest are the estimated variance 
components. Relative to the others, the items and student by items components are 
very large. Katers and students by raters, on the other hand, are very small. 
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Insert Table 2 about here 



Generalizability of the French Cloze Exam 

There are three primary coefficients of generalizability of interest re- 
garding the French Cloze test. The first of these, ep 2 (R,I), is the case where 
we desire to generalize results over both items and raters, considering conditions 
for both of these facets to be samples from some larger universe of conditions. 
The coefficient ep 2 (R) will stand for the case where raters are sampled from an 
infinite universe of raters, and the conditions of the item facet are assumed to 
exhaust all possible conditions of that facet, i.e., it*ms are assumed to be 
finite* Case III will denote the situation where we are considering the items 
to be sampled from an infinite set and the raters to be a finite sample. The 
generalizability coefficient for this situation will be denoted ep 2 (I). 

Which of these three generalizability coefficients will be appropriate 
for a D-study depends upon its purpose. If one wants a score uo be an estimate 
of what one would obtain responding to any infinite set o£ items all measuring 
ability in the French language and having the test scored by any of a large 
number of qualified raters, the ep 2 (R,I) is most appropriate. If, on the other 
hand, generalization beyond the set of items or raters used is not desired, 
then ep 2 (R) or ep 2 (I) is appropriate. (The fourth logical coefficient, with 
raters and iteras both finite, is not estimable from the data of this design - 
see Kane and Brennan in press.) 

Beyond the three possible situations or cases which we may consider for 
our decision study there are also three additional restrictions we may wish to 
impose on the decision model. The first of these restrictions is to assume 
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that all students will respond to the same set of items. This is different from 
the assumption that the items came from a finite set of items of a similar nature. 
Choosing to use ep 2 (R) rather than ep 2 (R,I) relates to the set of items to which 
we want to generalize. Choosing to employ restriction I relates to how we plan 
to administer the test. If all students get the same items, then the variance 
component for items does not enter into the generalizability calculation, be- 
cause scores are based on a sum over the same; items. Not employing restriction I 
implies a nested design where each student potentially receives a different set 
of items. 

Restriction II on the full model implies that we have the condition that 
all students are rated by the same raters. In this case rater variance may be 
ignored between students since it is assumed that any effect due to different 
raters has an equal effect over all students. 

Finally, the two previously mentioned restrictions can be combined into 
the situation where all students have the same raters and respond to the same 
exact set of iteus (restriction III). Again, it depends upon the purposes of 
the D-study as to whether or not these restrictions are appropriate. 
Formulation and Discussion of the Generalizability Coefficients 

This particular section will deal with the statistical formulation of the 
generalizability coefficients ep 2 (R,I), ep 2 (R), and ep 2 (I) under the full mDdel 
and with the thrue restrictions previously mentioned. The formulations are in 
accordance with recommended procedures for forming generalizability coefficients 
according to Cronbach, et al., (1972, chapter 3). The various formulas are pre- 
sented for reference in Table 3. 
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Insert Table 3 about here 

ep 2 (R,I) - Generalizability Case I - Items and Raters Infinite 

If we wish to generalize the results of this study over both items and 
raters, considering each of these facets to be samples randomly drawn from 
infinite universes of "similarly defined" items and raters, the appropriate 
coefficient of generalizability is ep 2 (R,I). The "universe score" for a 
student is defined to be the expected value of his/her average score on the 
exam, taken over all possible samples of items and raters. The expected ob- 
served score variance for student mean scores (o 2 ), under the full model, 

ODS * 

is composed of universe score variance a 2 and error variance, and is repre- 
sented by the formula: 

< l > a « + ^ +J ^ 2 +-^2 +_L C 2 

r i n r rs n i n r n i n r n i e * 

where n^ is the number of raters and n^ is the number of items involved in the 

decision study. Since we plan to generalize our results over both items and 

raters in Case I, the universe score variance (the numerator of ep 2 (R,I) ) is 

found by taking the limits of a 2 bQ as n r and n ± approach infinity. This leaves 

only o 2 as the estimated univerae score variance. The resulting generalizability 

coefficient then is: 



(2) ep 2 (R,I) * 



a 2 



obs 



the ratio of universe and observed score variance. 

The primary purpose of the generalizability study is to obtain estimates 
of the variance components (as shown in Table 2) so that we may formulate the 
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appropriate, or desired, coefficient for future decision studies. If we wish 
to estimate the dependability of student scores for a future decision 3tudy, it 
is only necessary to substitute into Formula 2 the appropriate number of raters 
and items for that study after the variance components h*ve been estimated by 
the G-study > 

The formulation of ep 2 (R,I) changes if we wish to impose one of the re- 
strictions, previously mentioned, on the design of the decision study, Table 3 
describes che effects of each of the thr«e restrictions upon *he appropriate 
generalizability coefficient. Recall that choosing to employ one of these re- 
strictions relates to how we plan to carry out the future decision stu 4 ~, and 
is part of the decision model. Which variance components are to be included in 
the calculation of the generalizability coefficient depends strictly on the pur- 
pose and design of the D-study. 

ep 2 (R) - Generalizability Case II - Items Finite and Raters Infinite 

ep 2 (R) is the coefficient we would employ if the desire is to generalize 
the results of this exam over raters, but not over items. The universe score 
for this coefficient is the expected value of a student's average score on the 
exam, as given by a random sample of raters from the domain of raters, using a 
finite set of these items in the D-study. With this coefficient we do not con- 
sider the items to be a sample from any larger set of items, and wish only to 
generalize our results for these items or some subset of them. Universe score 
variance is now equal to the observed score variance of Formula 1, as the number 
of raters (n r ) approaches infinity. The generalizability coefficient for this 
case is then given by: 



o 2 + ±-a 2 + -i-* 2 
(3) ep 2 (R) - 8 n i 1 n i 18 

t2 



obs 



0 

In Case I, where the universe score was defined as an expected value over 

an Infinite set of items, the \»ariance component -^-a 2 . was considered to be 

n^ si 

error variance. Now, siace fhe items in the exam are assumed to exhaust the 
universe of items, we cannot consider the sampling of student by item inter- 
actions as being error. The differential response of students to the items 
is now legitimately a part of the universe score (as is item variance). 
ep 2 (I) - Ceneralizabllity Case III - Iteme Infinite and Raters Finite 

The third generalizability coefficient, ep 2 (I), is obtained if we desire 
to generalize the results of the exam over items, but not beyond the finite 
number of raters in the G-study. This coefficient is basically a measure of 
internal consistency of the items on the French Cloze exam. ep 2 (I) ie approxi- 
mately equal to the expected correlation of any two measures of student per- 
formance on the items, based on an independent sample of -items from the domain 
of items, and a common sample of raters from this finite set of raters. The 
universe score for ep 2 (I) is defined as the expected value of the average 
student score on the exam, as given by a finite number of raters, using a random 
sample of items. Universe score variance is now composed of student variance, 
rater variance, and student by rater variance. The remaining fojur terms of the 
observed score variance ( a ^ s ) are considered differentiated error variance* 
The formulation of this third coefficient is as follows: 

o 2 + -^-o 2 + -i-a 2 
s n r n rs 

(4) ep2(I) » r r 



a obs 



Results and Discussion 
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Values for the three generalizability coefficients from the French Cloze 
test under the full model and the three restrictions are found in Table 4. 
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Fas: illustrative purposes, we have chosen four combinations: one item, one rater; 
*C ii-esas, one rater; 80 items, one rater; and 80 items, four raters. 



Insert Table 4 about here 

Perusal of Table A reveals several important relationships, especially in 
the context of efficiently designing future D-studies. First notice that 
ep 2 (R,I) is very close in magnitude to comparable values of ep 2 (I). This is 
a direct result of the relatively small variance of the rater by student inter- 
action. One implication of this Is that it makes little difference whether one 
wants to treat raters as finite or infinite. Another implication is that 
reliabilities based on iateritem consistency will not seriously overestimate 
ep 2 (R,I). 

Values of ep 2 (R) tend to be much larger than either ep 2 (R,I) or cp 2 (I). 
Thus, treating items as finite has profound consequences on resulting general- 
liability coefficients. Furthermore, in most educational settings it would be 
a mistake to do so, since we are typically measuring general knowledge of a 
content rather than knowledge specific to the questions asked. Measures of 
inter-rater reliability will tend to grossly overestimate the values of ep 2 (R,I), 

Comparable values of the generalizability coefficients in the full model 
and restriction II are nearly equal, as are comparable values of the general- 
izability coefficients for restriction I and III. These similarities are a 
direct result of the relatively small variance component of raters. However, 
the item variance component is larger and causes restriction I and III to 
yield coefficients which are higher than those of the full model and restric- 
tion 17.. The decision-making implications of this are twofold. One does not 
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10 

need to have every rater rate every student. Students can be nested within 
raters. However, unless the number of items is great (at least AO), one should 
have every student respond to tfce same set of items. 

Finally, it is clear that increases in both raters and items will increase 
generalizability. However, the impact of each successive increase becomes in- 
creasingly less. In the present case, adequate generalizability is obtained 
with only one rater and forty items for both ep 2 (R,I) and ep 2 (I). This is 
especially jrue if all students respond to the same set of items. If the 
number of items is increased to SO, generalizability exceeds .90. Increases 
in raters produce very little increase in generalizability and the number of 
raters can probably be reduced to one. 
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Table 1 

Random Effects ANOVA for 3-Way Completely Crossed Design 



Source of 
variance 


df 








E(ms) 


S 


8-1 


a 2 (e) 






+ ro 2 (si) + io 2 (sr) + rio 2 (p) 


R 


r-1 


o 2 (e) 


+ 


sa 2 (rl) 


+ io 2 (sr) + sia 2 (r) 


I 


1-1 


a 2 (e) 


+ 


so 2 (ri) 


+ ro 2 (si) + rso 2 (i) 


SR 


(s-l)(r-l) 


a 2 (e) 






+ io 2 (sr) 


SI 


(s-D(i-l) 


a 2 (e) 






+ ro 2 (si) 


RI 


(r-1) (1-1) 


a 2 (e) 


+ 


so 2 (rl) 




SRI(e) 


(r-l)(s-l)(l-l) 


a 2 (e) 









r ■ number of raters 
i - number of items 
8 * number of students 



Table 2 

The Analysis of Variance Summary Table 



Source ss 



s 


1154.81 


R 


13.89 


I 


2937.49 


Sft 


38.34 


SI 


6090.42 


RI 


61.98 


SRI(e) 


738.55 



df ms £2 

106 10.89 .074 

3 4.63 .001 

29 101.29 .231 

318 .12 .001 

3074 1.98 .475 

87 .71 .006 

9222 .08 .080 
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Table 3 13 
Generaligabllity Formulas 
n f ■ number of raters ■ number of items 



Case I: Items and Raters Infinite 



ep 2 (R.D- ^ + 1^4+1^2 + 1^,2 +JU ? +_L 0 2 j +_!^ 



2 

's ' n w r ' n ~i ' n "rs " n,~is " a n/ri ' n n. v c 
r x r 1 r x rx 



Let a 2 » the denominator of ep 2 (R f I). 

ODS 



Case II: Items Finite and Raters Infinite 

a 2 + -i-o? + ^-o? 
sp 2 (R) - 8 °i 1 n i is 

obe 

Case III: Items Infinite and Raters Finite 



a 2 + JLa2 + JLo2 
s n r n rs 

ep 2 (I) - r 



a 2 



obs 



Restrictions of the Full Model 

Restriction 1: All students respond to the same set of items. 

Eliminate ^-o? from universe and observed score variance* 
n i 1 

Restriction 2: All students are rated by the same raters. 

Eliminate -^-a 2 from universe and observed score variance. 
n * r 

Restriction 3: All students have the same items and raters. 

Eliminate ~o?, ^o 2 $ and ■ 1 o 2 from universe and observed 
i r r i 

score variance. 
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Table 4 

General! zability Coefficients 
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CASE I 

ep 2 (R,I) 
Items Infinite 
Raters Infinite 

CASE II 
ep 2 (R) 

Items Finite 
Raters Infinite 

CASE III 
ep 2 (I) 

Items Infinite 
Raters Finite 



I«No. of Items Full 

R -M6. o f Waters Model Res. I Res. II Res. Ill 



I=l» R"l .083 .113 .003 .114 

1-40, R=i .771 .822 .779 .831 

1-80, R-l .860 .892 .871 .902 

1-80, R-4 .892 .925 .892 .925 

1=1 » R-l .618 .836 .619 .845 

1-40, R-l .896 .956 .905 .966 

1-80, R-l .930 .964 .941 .976 

1-80, R-4 .964 .999 .964 .999 

1-1, R-l .084 .114 .085 .115 

1-40, R-l .781 .833 .789 .843 

1-80, R-l .872 .904 .882 .915 

1-80, R-4 .395 .929 .895 .925 
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